xkcd commentary on p-values
There are many packages to load. Here is a summary table of what each package does.
| Package | Purpose |
|---|---|
| car | Anova() function to extract type III sums of squares |
| lme4 | mixed models |
| nlme | mixed models, non-linear models, alternative covariance structures |
| emmeans | for extracting least squares means and contrasts |
| lmer test | improved summary functions of lmer objects |
| dplyr | data organization |
| forcats | for managing categorical data |
| agridat | has many agricultural data sets |
| agricolae | has options for many common agricultural experimental designs |
library(car)
library(lme4)
library(nlme)
library(emmeans) #note that this package also requires "multcompView" be installed (but it does not need to be loaded separately)
library(lmerTest)
library(dplyr)
library(forcats)
library(agridat)
(we will use this later, but it is not actually related to ANOVA)
na_func <- function(df) {
apply(df, 2, function(x) sum(is.na(x)))
}
ANOVA in R is a unfortunately a bit complicated. Unlike SAS, ANOVA functions in R lack a consistent structure, consistent output and the accessory packages for ANOVA display a patchwork of compatibility. The result is that it is easy to misspecify a model or make other mistakes. The information below is intended to serve as a guide through the R ANOVA wilderness.
There are some consistent features across ANOVA methods in R. Formula notation is often used in the R syntax for ANOVA functions. It looks like this: $Y ~ X, where Y is the dependent variable (the response) and X is/are the independent variable(s) (e.g. the experimental treatments).
my_formula <- formula(Y ~ treatment1 + treatment2)
class(my_formula)
## [1] "formula"
my_formula
## Y ~ treatment1 + treatment2
Often the independent variables (i,e, the treatments or the x variables) are expected to be factors, another type of R object:
my_var <- c(rep("low",5), rep("high", 5))
class(my_var) #check what variable type it is
## [1] "character"
# since "my_var" is not type factor, it needs to be converted to a
my_factor <- as.factor(my_var) # convert character variable to a factor
class(my_factor) # check variable type again to confirm
## [1] "factor"
Sometimes, there is a need to alter the order of treatment levels (well, how R sees those levels)
levels(my_factor) # see how R has ordered the factor levels and which levels are present
## [1] "high" "low"
# the default is alphabetical ordering
# the orders can be manually set:
my_factor <- factor(my_factor, levels = c("low", "high"))
Knowing the level order is important because in the implementation of ANOVA in R, the first level is treated as the reference level.
More on formulas:
The formula first shown, Y ~ treatment1 + treatment2, includes main effects only. Other formula notation includes the symbols : and *.
formula(Y ~ treatment1:treatment2) # interaction only
## Y ~ treatment1:treatment2
formula(Y ~ treatment1*treatment2) # interaction plus main effects
## Y ~ treatment1 * treatment2
These two formulas are equivalent:
formula(Y ~ treatment1 + treatment2 + treatment1:treatment2)
formula(Y ~ treatment1*treatment2)
Perhaps you can see from these examples that formulas are a really just a collections of characters (that is, a string) and exist independent of any data set. Later, we will need to link these formulas to a data set to actually construct a linear model and conduct statistical analysis.
First, load the data set “warpbreaks” (a data set from base R). This is an old data set with variables for wool type (A and B) and tension on the loom (L, M or H). The response variable is “breaks”, the number of times the wool thread breaks on industrial looms.
I always like to have a quick look at the data before running any statistical tests. So, here we go:
data(warpbreaks)
na_func(warpbreaks)
## breaks wool tension
## 0 0 0
str(warpbreaks)
## 'data.frame': 54 obs. of 3 variables:
## $ breaks : num 26 30 54 25 70 52 51 26 67 18 ...
## $ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
## $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
table(warpbreaks$wool, warpbreaks$tension)
##
## L M H
## A 9 9 9
## B 9 9 9
hist(warpbreaks$breaks, col = "gold")
boxplot(breaks ~ wool, data = warpbreaks, col = "orangered")
boxplot(breaks ~ tension, data = warpbreaks, col = "chartreuse")
This data set has 2 treatments. We don’t know if there is an interaction between the variables, yet. A good start is to run a linear model using lm() function, the linear regression function. As a reminder, ANOVA is a special case of the linear regression model where the predictors (the experimental treatments) are categories rather than a continuous variable.
# run standard linear model for main effects only
lm.mod1 <- lm(breaks ~ wool + tension, data = warpbreaks)
# extract type III sums of squares from that model
Anova(lm.mod1, type = "III")
## Anova Table (Type III tests)
##
## Response: breaks
## Sum Sq Df F value Pr(>F)
## (Intercept) 20827.0 1 154.3226 < 2.2e-16 ***
## wool 450.7 1 3.3393 0.073614 .
## tension 2034.3 2 7.5367 0.001378 **
## Residuals 6747.9 50
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# run a linear model with main effects and interactions
lm.mod2 <- lm(breaks ~ wool*tension, data = warpbreaks)
# ...and type III sums of squares
Anova(lm.mod2, type = "III")
## Anova Table (Type III tests)
##
## Response: breaks
## Sum Sq Df F value Pr(>F)
## (Intercept) 17866.8 1 149.2757 2.426e-16 ***
## wool 1200.5 1 10.0301 0.0026768 **
## tension 2468.5 2 10.3121 0.0001881 ***
## wool:tension 1002.8 2 4.1891 0.0210442 *
## Residuals 5745.1 48
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# FYI:
# type II sums of squares:
# Anova(lm.mod2, type = "II")
# type I sums of squares:
# anova(lm.mod2)
A few comments on types of sums of squares:
As a reminder, the type of sums of squares used in statistical tests can impact the results and subsequent interpretation. Type I, sums of squares tests for statistical significance by adding one variable to the model at time (and hence is also called “sequential”). It is very dependent on the order variables are added to the model and hence is often not the best choice for many agricultural experiment. Type III sums of squares (also called “partial” or “marginal”) evaluates the statistical significance of variable or interaction, assuming that the other variables are in the model. This is a decent default option. The last option is Type II sums of squares, which is the best option when you are sure there are no interactions between variables.
# conduct an F test comparing the models
anova(lm.mod1, lm.mod2)
## Analysis of Variance Table
##
## Model 1: breaks ~ wool + tension
## Model 2: breaks ~ wool * tension
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 50 6747.9
## 2 48 5745.1 2 1002.8 4.1891 0.02104 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# also, consider doing a stepwise approach for finding the best model:
step(lm.mod2)
## Start: AIC=264.02
## breaks ~ wool * tension
##
## Df Sum of Sq RSS AIC
## <none> 5745.1 264.02
## - wool:tension 2 1002.8 6747.9 268.71
##
## Call:
## lm(formula = breaks ~ wool * tension, data = warpbreaks)
##
## Coefficients:
## (Intercept) woolB tensionM tensionH
## 44.56 -16.33 -20.56 -20.00
## woolB:tensionM woolB:tensionH
## 21.11 10.56
plot(lm.mod2) #this will produce 4 plots of residuals
shapiro.test(resid(lm.mod2)) #standard shapiro-wilk test. interpret with a grain of salt
##
## Shapiro-Wilk normality test
##
## data: resid(lm.mod2)
## W = 0.98686, p-value = 0.8162
# extract least squares means
(lsm <- emmeans(lm.mod2, ~ tension))
## NOTE: Results may be misleading due to involvement in interactions
## tension emmean SE df lower.CL upper.CL
## L 36.38889 2.57865 48 31.20417 41.57361
## M 26.38889 2.57865 48 21.20417 31.57361
## H 21.66667 2.57865 48 16.48194 26.85139
##
## Results are averaged over the levels of: wool
## Confidence level used: 0.95
# make all pairwise comparisons within each factor
contrast(lsm, "pairwise")
## contrast estimate SE df t.ratio p.value
## L - M 10.000000 3.646761 48 2.742 0.0229
## L - H 14.722222 3.646761 48 4.037 0.0006
## M - H 4.722222 3.646761 48 1.295 0.4049
##
## Results are averaged over the levels of: wool
## P value adjustment: tukey method for comparing a family of 3 estimates
# compare low tensions versus Medium and Medium
levels(warpbreaks$tension)
## [1] "L" "M" "H"
cList <- list(LvMH = c(1, -0.5, -0.5))
# check that contrast sums to zero
lapply(cList, sum)
## $LvMH
## [1] 0
# perform contrast
summary(contrast(lsm, cList, adjust = "none"))
## contrast estimate SE df t.ratio p.value
## LvMH 12.36111 3.158188 48 3.914 0.0003
##
## Results are averaged over the levels of: wool
This example uses rapeseed yield data from multiple locations, years and cultivars. Within a single location or year, the replication is often balanced.
Load Data and examine:
data(shafii.rapeseed)
rapeseed1987 <- shafii.rapeseed %>% filter(year == 87) %>%
mutate(block = fct_drop(rep), Cv = fct_drop(gen), loc = fct_drop(loc))
str(rapeseed1987)
## 'data.frame': 216 obs. of 7 variables:
## $ year : int 87 87 87 87 87 87 87 87 87 87 ...
## $ loc : Factor w/ 9 levels "GGA","ID","MT",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ rep : Factor w/ 4 levels "R1","R2","R3",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ gen : Factor w/ 6 levels "Bienvenu","Bridger",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ yield: num 961 1329 1781 1698 1605 ...
## $ block: Factor w/ 4 levels "R1","R2","R3",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ Cv : Factor w/ 6 levels "Bienvenu","Bridger",..: 1 1 1 1 2 2 2 2 3 3 ...
na_func(rapeseed1987)
## year loc rep gen yield block Cv
## 0 0 0 0 0 0 0
table(rapeseed1987$Cv, rapeseed1987$loc) #experiment has 1 rep per block
##
## GGA ID MT NC OR SC TGA TX WA
## Bienvenu 4 4 4 4 4 4 4 4 4
## Bridger 4 4 4 4 4 4 4 4 4
## Cascade 4 4 4 4 4 4 4 4 4
## Dwarf 4 4 4 4 4 4 4 4 4
## Glacier 4 4 4 4 4 4 4 4 4
## Jet 4 4 4 4 4 4 4 4 4
hist(rapeseed1987$yield, col = "gold")
boxplot(yield ~ Cv, data = rapeseed1987, col = "orangered")
boxplot(yield ~ loc, data = rapeseed1987, col = "chartreuse")
Analyse experiment:
# for this example, the analysis will only be done for a single year
# block is nested within location
# if each block had a unique name, 'Error(block)' would suffce
shaf.aov <- aov(yield ~ Cv*loc + Error(block) + Cv*loc, data = rapeseed1987)
summary(shaf.aov)
##
## Error: block
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 3 336565 112188
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## Cv 5 3203992 640798 2.645 0.025111 *
## loc 8 318197192 39774649 164.165 < 2e-16 ***
## Cv:loc 40 22707425 567686 2.343 0.000103 ***
## Residuals 159 38523267 242285
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(shaf.aov, ~ Cv | loc)
## Note: re-fitting model with sum-to-zero contrasts
## loc = GGA:
## Cv emmean SE df lower.CL upper.CL
## Bienvenu 1442.317 244.8854 161.14 958.719 1925.916
## Bridger 1363.295 244.8854 161.14 879.696 1846.894
## Cascade 1504.527 244.8854 161.14 1020.929 1988.126
## Dwarf 1294.920 244.8854 161.14 811.321 1778.519
## Glacier 1680.510 244.8854 161.14 1196.911 2164.109
## Jet 1090.917 244.8854 161.14 607.319 1574.516
##
## loc = ID:
## Cv emmean SE df lower.CL upper.CL
## Bienvenu 1242.155 244.8854 161.14 758.556 1725.754
## Bridger 946.671 244.8854 161.14 463.072 1430.269
## Cascade 773.284 244.8854 161.14 289.685 1256.883
## Dwarf 931.553 244.8854 161.14 447.954 1415.151
## Glacier 1111.030 244.8854 161.14 627.431 1594.629
## Jet 1064.030 244.8854 161.14 580.431 1547.629
##
## loc = MT:
## Cv emmean SE df lower.CL upper.CL
## Bienvenu 2616.000 244.8854 161.14 2132.401 3099.599
## Bridger 2828.250 244.8854 161.14 2344.651 3311.849
## Cascade 2916.250 244.8854 161.14 2432.651 3399.849
## Dwarf 3451.750 244.8854 161.14 2968.151 3935.349
## Glacier 3306.750 244.8854 161.14 2823.151 3790.349
## Jet 3660.500 244.8854 161.14 3176.901 4144.099
##
## loc = NC:
## Cv emmean SE df lower.CL upper.CL
## Bienvenu 1000.962 244.8854 161.14 517.363 1484.560
## Bridger 1064.289 244.8854 161.14 580.691 1547.888
## Cascade 745.398 244.8854 161.14 261.800 1228.997
## Dwarf 1013.852 244.8854 161.14 530.254 1497.451
## Glacier 1229.343 244.8854 161.14 745.744 1712.941
## Jet 1673.780 244.8854 161.14 1190.181 2157.379
##
## loc = OR:
## Cv emmean SE df lower.CL upper.CL
## Bienvenu 4555.895 244.8854 161.14 4072.296 5039.494
## Bridger 2529.867 244.8854 161.14 2046.269 3013.466
## Cascade 3335.792 244.8854 161.14 2852.194 3819.391
## Dwarf 3931.832 244.8854 161.14 3448.234 4415.431
## Glacier 4185.157 244.8854 161.14 3701.559 4668.756
## Jet 3219.500 244.8854 161.14 2735.901 3703.099
##
## loc = SC:
## Cv emmean SE df lower.CL upper.CL
## Bienvenu 2499.750 244.8854 161.14 2016.151 2983.349
## Bridger 2705.000 244.8854 161.14 2221.401 3188.599
## Cascade 2118.750 244.8854 161.14 1635.151 2602.349
## Dwarf 1893.750 244.8854 161.14 1410.151 2377.349
## Glacier 2717.250 244.8854 161.14 2233.651 3200.849
## Jet 2832.750 244.8854 161.14 2349.151 3316.349
##
## loc = TGA:
## Cv emmean SE df lower.CL upper.CL
## Bienvenu 1257.778 244.8854 161.14 774.180 1741.377
## Bridger 1867.695 244.8854 161.14 1384.096 2351.294
## Cascade 1707.827 244.8854 161.14 1224.229 2191.426
## Dwarf 872.602 244.8854 161.14 389.004 1356.201
## Glacier 1453.235 244.8854 161.14 969.636 1936.834
## Jet 954.005 244.8854 161.14 470.406 1437.603
##
## loc = TX:
## Cv emmean SE df lower.CL upper.CL
## Bienvenu 838.000 244.8854 161.14 354.401 1321.599
## Bridger 1069.000 244.8854 161.14 585.401 1552.599
## Cascade 734.750 244.8854 161.14 251.151 1218.349
## Dwarf 988.250 244.8854 161.14 504.651 1471.849
## Glacier 951.750 244.8854 161.14 468.151 1435.349
## Jet 1408.250 244.8854 161.14 924.651 1891.849
##
## loc = WA:
## Cv emmean SE df lower.CL upper.CL
## Bienvenu 4375.000 244.8854 161.14 3891.401 4858.599
## Bridger 4603.750 244.8854 161.14 4120.151 5087.349
## Cascade 4464.250 244.8854 161.14 3980.651 4947.849
## Dwarf 3974.000 244.8854 161.14 3490.401 4457.599
## Glacier 4740.000 244.8854 161.14 4256.401 5223.599
## Jet 4344.250 244.8854 161.14 3860.651 4827.849
##
## Confidence level used: 0.95
(models with random and fixed effects)
Random effects are often those treatments levels drawn from a large population of possible treatment levels and there is interest in understanding the distribution and variance of that population. This in contrast to fixed effects, where the inferences are restricted to the treatment levels tested.
Blocking factors and Year are often considered random factors because a researcher is not interested in particular years or a blocking factor. When there is unbalanced replication, the variance components should be estimated with maximum likelihood or REML, which implies use of the packages “lmer” and/or “nlme”.
The “shafii.rapeseed” data set will be used for this section.
Analyse experiment using a mixed model:
This uses the function lme() from the package “nlme”. Functionally, it is very similar to calling lme4::lmer(). The degrees of freedom are different (lmer is using Satterthwaite’s approximation), but the p-values are the same.
# turn year into the factor "Year"
shafii.rapeseed$Year <- as.factor(shafii.rapeseed$year)
# create a blocking variable that is unique for each location-by-year combination
# so R doesn't conflate "R1" from one location/year with another location/year
shafii.rapeseed$Rep <- as.factor(paste(shafii.rapeseed$loc, shafii.rapeseed$year, shafii.rapeseed$rep, sep = "_"))
shaf.lme <- lme(fixed = yield ~ gen*loc + Year,
random = ~ 1|Rep,
data = shafii.rapeseed, method = "REML")
# view sum of squares table
# when anova() is called for an lme object, the function called is actually anova.lme()
anova(shaf.lme, type = "marginal") # "marginal" is equivalent to type III sums of squares
## numDF denDF F-value p-value
## (Intercept) 1 470 16.204597 0.0001
## gen 5 470 1.092341 0.3637
## loc 13 92 13.074492 <.0001
## Year 2 92 2.035054 0.1365
## gen:loc 65 470 2.575753 <.0001
# FYI: use "anova(model.lme)" for type I sums of squares
# lmer notation
shaf.lmer <- lmer(yield ~ gen*loc + Year + (1|Rep),
data = shafii.rapeseed, REML = T)
anova(shaf.lmer, type = "marginal")
## Marginal Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## gen 1860586 372117 5 470.00 1.0923 0.3637
## loc 57901483 4453960 13 159.37 13.0745 < 2.2e-16 ***
## Year 1386524 693262 2 92.00 2.0351 0.1365
## gen:loc 57034691 877457 65 470.00 2.5758 5.499e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(shaf.lme)
qqnorm(shaf.lme, abline = c(0, 1))
# for cultivar
(lme.means.cv <- emmeans(shaf.lme, "gen"))
## NOTE: Results may be misleading due to involvement in interactions
## gen emmean SE df lower.CL upper.CL
## Bienvenu 2432.485 111.6229 92 2210.792 2654.178
## Bridger 2313.914 111.6229 92 2092.221 2535.607
## Cascade 2184.142 111.6229 92 1962.449 2405.835
## Dwarf 2308.377 111.6229 92 2086.684 2530.070
## Glacier 2463.481 111.6229 92 2241.788 2685.173
## Jet 2303.783 111.6229 92 2082.091 2525.476
##
## Results are averaged over the levels of: loc, Year
## Degrees-of-freedom method: containment
## Confidence level used: 0.95
# for location
(lme.means.loc <- emmeans(shaf.lme, "loc"))
## NOTE: Results may be misleading due to involvement in interactions
## loc emmean SE df lower.CL upper.CL
## GGA 1682.169 328.5723 92 1029.596 2334.742
## ID 4217.051 261.3686 92 3697.950 4736.151
## KS 1119.594 476.3386 92 173.544 2065.644
## MS 2203.887 476.3386 92 1257.837 3149.936
## MT 3338.808 473.7707 92 2397.858 4279.758
## NC 1328.244 328.5723 92 675.671 1980.817
## NY 3138.795 476.3386 92 2192.746 4084.845
## OR 3292.090 328.5723 92 2639.517 3944.663
## SC 1818.962 261.3686 92 1299.862 2338.063
## TGA 1028.288 261.3686 92 509.187 1547.388
## TN 2543.467 476.3386 92 1597.418 3489.517
## TX 826.748 328.5723 92 174.175 1479.321
## VA 2281.657 327.6428 92 1630.930 2932.384
## WA 3861.333 261.3686 92 3342.233 4380.434
##
## Results are averaged over the levels of: gen, Year
## Degrees-of-freedom method: containment
## Confidence level used: 0.95
# for cultivar means within each location
lme.means.int <- emmeans(shaf.lme, ~ gen | loc + Year)
# this code would produce location means within each cultivar
# emmeans(model.lme, ~ loc | gen))
# all possible combinations exist!
# also:
# emmeans(model.lme, ~ loc | gen)) is the same as emmeans(model.lme, ~ gen | loc))
# all pairwise
pairs(lme.means.cv)
## contrast estimate SE df t.ratio p.value
## Bienvenu - Bridger 118.57068 87.61532 470 1.353 0.7548
## Bienvenu - Cascade 248.34262 87.61532 470 2.834 0.0539
## Bienvenu - Dwarf 124.10773 87.61532 470 1.417 0.7170
## Bienvenu - Glacier -30.99567 87.61532 470 -0.354 0.9993
## Bienvenu - Jet 128.70157 87.61532 470 1.469 0.6843
## Bridger - Cascade 129.77193 87.61532 470 1.481 0.6765
## Bridger - Dwarf 5.53704 87.61532 470 0.063 1.0000
## Bridger - Glacier -149.56635 87.61532 470 -1.707 0.5277
## Bridger - Jet 10.13088 87.61532 470 0.116 1.0000
## Cascade - Dwarf -124.23489 87.61532 470 -1.418 0.7161
## Cascade - Glacier -279.33829 87.61532 470 -3.188 0.0190
## Cascade - Jet -119.64105 87.61532 470 -1.366 0.7477
## Dwarf - Glacier -155.10340 87.61532 470 -1.770 0.4861
## Dwarf - Jet 4.59384 87.61532 470 0.052 1.0000
## Glacier - Jet 159.69724 87.61532 470 1.823 0.4521
##
## Results are averaged over the levels of: loc, Year
## P value adjustment: tukey method for comparing a family of 6 estimates
# HSD-like comparisons:
CLD(lme.means.cv)
## gen emmean SE df lower.CL upper.CL .group
## Cascade 2184.142 111.6229 92 1962.449 2405.835 1
## Jet 2303.783 111.6229 92 2082.091 2525.476 12
## Dwarf 2308.377 111.6229 92 2086.684 2530.070 12
## Bridger 2313.914 111.6229 92 2092.221 2535.607 12
## Bienvenu 2432.485 111.6229 92 2210.792 2654.178 12
## Glacier 2463.481 111.6229 92 2241.788 2685.173 2
##
## Results are averaged over the levels of: loc, Year
## Degrees-of-freedom method: containment
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 6 estimates
## significance level used: alpha = 0.05
CLD(lme.means.loc)
## loc emmean SE df lower.CL upper.CL .group
## TX 826.748 328.5723 92 174.175 1479.321 1
## TGA 1028.288 261.3686 92 509.187 1547.388 1
## KS 1119.594 476.3386 92 173.544 2065.644 123
## NC 1328.244 328.5723 92 675.671 1980.817 12
## GGA 1682.169 328.5723 92 1029.596 2334.742 1234
## SC 1818.962 261.3686 92 1299.862 2338.063 123
## MS 2203.887 476.3386 92 1257.837 3149.936 12345
## VA 2281.657 327.6428 92 1630.930 2932.384 1234
## TN 2543.467 476.3386 92 1597.418 3489.517 123456
## NY 3138.795 476.3386 92 2192.746 4084.845 23456
## OR 3292.090 328.5723 92 2639.517 3944.663 456
## MT 3338.808 473.7707 92 2397.858 4279.758 3456
## WA 3861.333 261.3686 92 3342.233 4380.434 56
## ID 4217.051 261.3686 92 3697.950 4736.151 6
##
## Results are averaged over the levels of: gen, Year
## Degrees-of-freedom method: containment
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 14 estimates
## significance level used: alpha = 0.05
# CLD(lme.means.int) # not actually run because it results in copious output
# plot same results
plot(lme.means.cv, comparison = T)
plot(lme.means.loc, comparison = T, horizontal = F) # rotate plots to vertical position
## Comparison discrepancy in group 1, GGA - OR:
## Target overlap = 0.0083, overlap on graph = -0.0111
# blue bars = lsmeans confidence 95% confidence intervals
# red arrows. pairwise differnces (overlapping arrows = not significantly different)
Interaction plots can also be done:
(but, it gets unwieldy)
plot(lme.means.int, comparison = T, adjust = "tukey")
# compare to a control, e.g. "Bridger"
levels(shafii.rapeseed$gen)
## [1] "Bienvenu" "Bridger" "Cascade" "Dwarf" "Glacier" "Jet"
# Bridger is listed in position 2 of the factor 'shafii.rapeseed$gen'
# so '2' is set as the reference level in the following contrast statement:
# "trt.vs.ctrlk" (treatment versus control treatment k) is a specific option to compare all treatment levels to a user-defined level
# by default, it will use the last level as the reference level
contrast(lme.means.cv, "trt.vs.ctrlk", ref = 2)
## contrast estimate SE df t.ratio p.value
## Bienvenu - Bridger 118.57068 87.61532 470 1.353 0.5118
## Cascade - Bridger -129.77193 87.61532 470 -1.481 0.4315
## Dwarf - Bridger -5.53704 87.61532 470 -0.063 0.9998
## Glacier - Bridger 149.56635 87.61532 470 1.707 0.3034
## Jet - Bridger -10.13088 87.61532 470 -0.116 0.9990
##
## Results are averaged over the levels of: loc, Year
## P value adjustment: dunnettx method for 5 tests
Search ?contrast.emmGrid to see full list of options for preset contrasts.
# example: contrast Western locations versus Southern locations
# first, find out what levels are present
unique(shafii.rapeseed$loc)
## [1] GGA ID KS MS MT NC NY OR SC TGA TN TX VA WA
## Levels: GGA ID KS MS MT NC NY OR SC TGA TN TX VA WA
# next create a contrast list
# this is a list of coefficients as long your list of treatement levels
# indicating what coefficients to give each treatment level
# in this example, levels "ID", "MT", "OR", and "WA" are contrasted versus
# "NC", "SC", "MS", "TN", "TX" and "VA"
cList <- list(West_V_South = c(0, 1/4, 0, -1/6, 1/4, -1/6, 0, 1/4, -1/6, 0, -1/6, -1/6, -1/6, 1/4))
# check that each contrast sums to zero:
lapply(cList, sum)
## $West_V_South
## [1] 5.551115e-17
lme.means.loc2 <- emmeans(shaf.lme, "loc", contr = cList)
## NOTE: Results may be misleading due to involvement in interactions
summary(lme.means.loc2)
## $emmeans
## loc emmean SE df lower.CL upper.CL
## GGA 1682.169 328.5723 92 1029.596 2334.742
## ID 4217.051 261.3686 92 3697.950 4736.151
## KS 1119.594 476.3386 92 173.544 2065.644
## MS 2203.887 476.3386 92 1257.837 3149.936
## MT 3338.808 473.7707 92 2397.858 4279.758
## NC 1328.244 328.5723 92 675.671 1980.817
## NY 3138.795 476.3386 92 2192.746 4084.845
## OR 3292.090 328.5723 92 2639.517 3944.663
## SC 1818.962 261.3686 92 1299.862 2338.063
## TGA 1028.288 261.3686 92 509.187 1547.388
## TN 2543.467 476.3386 92 1597.418 3489.517
## TX 826.748 328.5723 92 174.175 1479.321
## VA 2281.657 327.6428 92 1630.930 2932.384
## WA 3861.333 261.3686 92 3342.233 4380.434
##
## Results are averaged over the levels of: gen, Year
## Degrees-of-freedom method: containment
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## West_V_South 1843.493 233.0729 92 7.910 <.0001
##
## Results are averaged over the levels of: gen, Year
# same contrast can also be done within each level of Cv:
emmeans(shaf.lme, ~ loc | gen, contr = cList)
## $emmeans
## gen = Bienvenu:
## loc emmean SE df lower.CL upper.CL
## GGA 1784.523 378.7416 92 1032.309 2536.736
## ID 4742.446 303.2664 92 4140.133 5344.759
## KS 1178.552 545.7741 92 94.598 2262.507
## MS 2455.425 545.7741 92 1371.470 3539.379
## MT 2824.891 543.5344 92 1745.385 3904.398
## NC 1329.577 378.7416 92 577.363 2081.790
## NY 2933.641 545.7741 92 1849.687 4017.596
## OR 4118.193 378.7416 92 3365.980 4870.407
## SC 1843.733 303.2664 92 1241.419 2446.046
## TGA 893.306 303.2664 92 290.992 1495.619
## TN 2964.549 545.7741 92 1880.594 4048.503
## TX 919.252 378.7416 92 167.038 1671.466
## VA 2123.952 377.9355 92 1373.339 2874.564
## WA 3942.750 303.2664 92 3340.437 4545.063
##
## gen = Bridger:
## loc emmean SE df lower.CL upper.CL
## GGA 1470.389 378.7416 92 718.175 2222.602
## ID 3591.466 303.2664 92 2989.153 4193.779
## KS 1091.302 545.7741 92 7.348 2175.257
## MS 2477.845 545.7741 92 1393.890 3561.799
## MT 3037.141 543.5344 92 1957.635 4116.648
## NC 1479.492 378.7416 92 727.279 2231.706
## NY 3129.554 545.7741 92 2045.599 4213.508
## OR 2564.206 378.7416 92 1811.992 3316.419
## SC 2281.892 303.2664 92 1679.579 2884.206
## TGA 1602.976 303.2664 92 1000.663 2205.290
## TN 2485.286 545.7741 92 1401.332 3569.241
## TX 851.291 378.7416 92 99.078 1603.505
## VA 2397.374 377.9355 92 1646.762 3147.987
## WA 3934.583 303.2664 92 3332.270 4536.897
##
## gen = Cascade:
## loc emmean SE df lower.CL upper.CL
## GGA 1758.460 378.7416 92 1006.247 2510.674
## ID 4081.356 303.2664 92 3479.042 4683.669
## KS 890.552 545.7741 92 -193.402 1974.507
## MS 1597.995 545.7741 92 514.040 2681.949
## MT 3125.141 543.5344 92 2045.635 4204.648
## NC 1061.822 378.7416 92 309.608 1814.035
## NY 2586.169 545.7741 92 1502.214 3670.123
## OR 2806.038 378.7416 92 2053.825 3558.252
## SC 1982.013 303.2664 92 1379.699 2584.326
## TGA 1492.147 303.2664 92 889.833 2094.460
## TN 2006.321 545.7741 92 922.367 3090.276
## TX 795.807 378.7416 92 43.593 1548.020
## VA 2191.172 377.9355 92 1440.559 2941.784
## WA 4203.000 303.2664 92 3600.687 4805.313
##
## gen = Dwarf:
## loc emmean SE df lower.CL upper.CL
## GGA 1537.924 378.7416 92 785.710 2290.137
## ID 4326.120 303.2664 92 3723.807 4928.433
## KS 1207.802 545.7741 92 123.848 2291.757
## MS 1965.647 545.7741 92 881.693 3049.602
## MT 3660.641 543.5344 92 2581.135 4740.148
## NC 1320.748 378.7416 92 568.534 2072.961
## NY 3645.214 545.7741 92 2561.259 4729.168
## OR 3593.612 378.7416 92 2841.398 4345.825
## SC 1292.411 303.2664 92 690.098 1894.724
## TGA 450.502 303.2664 92 -151.811 1052.815
## TN 2687.529 545.7741 92 1603.574 3771.483
## TX 653.570 378.7416 92 -98.644 1405.783
## VA 2249.727 377.9355 92 1499.115 3000.340
## WA 3725.833 303.2664 92 3123.520 4328.147
##
## gen = Glacier:
## loc emmean SE df lower.CL upper.CL
## GGA 2030.559 378.7416 92 1278.345 2782.772
## ID 4298.921 303.2664 92 3696.608 4901.234
## KS 1267.802 545.7741 92 183.848 2351.757
## MS 2860.772 545.7741 92 1776.818 3944.727
## MT 3515.641 543.5344 92 2436.135 4595.148
## NC 1452.032 378.7416 92 699.818 2204.245
## NY 3301.441 545.7741 92 2217.487 4385.396
## OR 3471.574 378.7416 92 2719.361 4223.788
## SC 2025.279 303.2664 92 1422.966 2627.592
## TGA 1109.218 303.2664 92 506.905 1711.531
## TN 2264.536 545.7741 92 1180.582 3348.491
## TX 720.368 378.7416 92 -31.846 1472.581
## VA 2363.251 377.9355 92 1612.638 3113.863
## WA 3807.333 303.2664 92 3205.020 4409.647
##
## gen = Jet:
## loc emmean SE df lower.CL upper.CL
## GGA 1511.161 378.7416 92 758.948 2263.375
## ID 4261.996 303.2664 92 3659.683 4864.309
## KS 1081.552 545.7741 92 -2.402 2165.507
## MS 1865.635 545.7741 92 781.680 2949.589
## MT 3869.391 543.5344 92 2789.885 4948.898
## NC 1325.793 378.7416 92 573.579 2078.006
## NY 3236.754 545.7741 92 2152.799 4320.708
## OR 3198.914 378.7416 92 2446.701 3951.128
## SC 1488.445 303.2664 92 886.132 2090.758
## TGA 621.578 303.2664 92 19.265 1223.891
## TN 2852.581 545.7741 92 1768.627 3936.536
## TX 1020.200 378.7416 92 267.987 1772.414
## VA 2364.466 377.9355 92 1613.853 3115.078
## WA 3554.500 303.2664 92 2952.187 4156.813
##
## Results are averaged over the levels of: Year
## Degrees-of-freedom method: containment
## Confidence level used: 0.95
##
## $contrasts
## gen = Bienvenu:
## contrast estimate SE df t.ratio p.value
## West_V_South 1967.656 267.3775 92 7.359 <.0001
##
## gen = Bridger:
## contrast estimate SE df t.ratio p.value
## West_V_South 1286.319 267.3775 92 4.811 <.0001
##
## gen = Cascade:
## contrast estimate SE df t.ratio p.value
## West_V_South 1948.029 267.3775 92 7.286 <.0001
##
## gen = Dwarf:
## contrast estimate SE df t.ratio p.value
## West_V_South 2131.613 267.3775 92 7.972 <.0001
##
## gen = Glacier:
## contrast estimate SE df t.ratio p.value
## West_V_South 1825.661 267.3775 92 6.828 <.0001
##
## gen = Jet:
## contrast estimate SE df t.ratio p.value
## West_V_South 1901.680 267.3775 92 7.112 <.0001
##
## Results are averaged over the levels of: Year
To conduct custom contrasts on a another variable (e.g. “loc”), that needs its own cList and emmeans statement.
(analysis of covariance) From a R programming perspective, this is no different than running a standard linear model. A data set from agridat, “theobald covariate” comparing corn silage yields across multiple years, locations and cultivars. The data set includes a covariate, “chu” (corn heat units, a bit like growing degree days).
Load data and examine:
data(theobald.covariate)
str(theobald.covariate)
## 'data.frame': 256 obs. of 5 variables:
## $ year : int 1990 1990 1990 1990 1990 1991 1991 1991 1991 1991 ...
## $ env : Factor w/ 7 levels "E1","E2","E3",..: 1 2 3 4 7 1 2 3 4 5 ...
## $ gen : Factor w/ 10 levels "G01","G02","G03",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ yield: num 6.27 5.57 8.45 7.35 6.5 6.71 5.59 8.36 7.25 8.09 ...
## $ chu : num 2.57 2.53 2.72 2.72 2.48 2.44 2.55 2.75 2.75 2.61 ...
na_func(theobald.covariate)
## year env gen yield chu
## 0 0 0 0 0
Exploratory plots:
# distributions of continuous variables
hist(theobald.covariate$yield, col = "gold")
hist(theobald.covariate$chu, col = "gray70")
# relationship between reponse variable and covariate:
with(theobald.covariate, plot(chu, yield))
length(unique(theobald.covariate$chu))
## [1] 21
# the usual boxplots:
boxplot(yield ~ env, data = theobald.covariate, col = "orangered")
boxplot(yield ~ year, data = theobald.covariate, col = "chartreuse")
boxplot(yield ~ gen, data = theobald.covariate, col = "darkcyan")
Check level of replication:
theobald.covariate$Year <- as.factor(theobald.covariate$year)
replications(yield ~ Year + env + gen, data = theobald.covariate)
## $Year
## Year
## 1990 1991 1992 1993 1994
## 40 63 60 45 48
##
## $env
## env
## E1 E2 E3 E4 E5 E6 E7
## 35 35 44 36 36 36 34
##
## $gen
## gen
## G01 G02 G03 G04 G05 G06 G07 G08 G09 G10
## 29 29 29 29 22 29 23 18 24 24
# with(theobald.covariate, table(gen, env, Year)) # lots of useful output
The treatments are not fully crossed, so a fully specified model of the form yield ~ Year*env*gen*chu cannot be tested. The treatments and interactions were tested in reduced models and compared. The final “best” model is shown below.
# the covariate, chu, is added in like any other effect.
theobald.lm2 <- lm(yield ~ Year + env*chu, data = theobald.covariate)
Anova(theobald.lm2, type = "III")
## Anova Table (Type III tests)
##
## Response: yield
## Sum Sq Df F value Pr(>F)
## (Intercept) 4.309 1 6.8321 0.009524 **
## Year 76.589 4 30.3607 < 2.2e-16 ***
## env 13.473 6 3.5607 0.002138 **
## chu 11.831 1 18.7596 2.187e-05 ***
## env:chu 13.376 6 3.5350 0.002268 **
## Residuals 150.096 238
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# how to extract the covariate slope(s):
emtrends(theobald.lm2, ~ env, "chu")
## env chu.trend SE df lower.CL upper.CL
## E1 7.014988 1.619627 238 3.824352 10.205624
## E2 0.978518 4.437201 238 -7.762686 9.719722
## E3 4.099310 3.152818 238 -2.111683 10.310303
## E4 -2.884280 3.543760 238 -9.865421 4.096861
## E5 8.222358 2.700385 238 2.902649 13.542066
## E6 3.424701 2.720563 238 -1.934759 8.784161
## E7 -0.358948 2.548834 238 -5.380105 4.662209
##
## Results are averaged over the levels of: Year
## Confidence level used: 0.95
# emmeans extracted as usual:
emmeans(theobald.lm2, ~ env)
## NOTE: Results may be misleading due to involvement in interactions
## env emmean SE df lower.CL upper.CL
## E1 6.666544 0.1749185 238 6.321958 7.011131
## E2 5.134968 0.2563050 238 4.630052 5.639884
## E3 6.658226 0.4819422 238 5.708809 7.607644
## E4 7.224342 0.5075654 238 6.224447 8.224236
## E5 6.608382 0.1384270 238 6.335684 6.881081
## E6 6.430967 0.2359627 238 5.966125 6.895809
## E7 6.318851 0.3972573 238 5.536262 7.101440
##
## Results are averaged over the levels of: Year
## Confidence level used: 0.95
emmeans(theobald.lm2, ~ Year)
## Year emmean SE df lower.CL upper.CL
## 1990 6.969702 0.1885709 238 6.598221 7.341183
## 1991 6.746088 0.1697058 238 6.411771 7.080405
## 1992 7.070521 0.1872254 238 6.701690 7.439351
## 1993 5.390109 0.2080069 238 4.980339 5.799878
## 1994 5.996638 0.2183068 238 5.566578 6.426699
##
## Results are averaged over the levels of: env
## Confidence level used: 0.95
Load data Load “Oats” from nlme. Nitrogen level (“nitro”) is the main plot, cultivar (“Variety”) is the sub-plot and “Block” describes the blocking design.
data(Oats)
str(Oats)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 72 obs. of 4 variables:
## $ Block : Ord.factor w/ 6 levels "VI"<"V"<"III"<..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Variety: Factor w/ 3 levels "Golden Rain",..: 3 3 3 3 1 1 1 1 2 2 ...
## $ nitro : num 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 ...
## $ yield : num 111 130 157 174 117 114 161 141 105 140 ...
## - attr(*, "formula")=Class 'formula' language yield ~ nitro | Block
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## - attr(*, "labels")=List of 2
## ..$ y: chr "Yield"
## ..$ x: chr "Nitrogen concentration"
## - attr(*, "units")=List of 2
## ..$ y: chr "(bushels/acre)"
## ..$ x: chr "(cwt/acre)"
## - attr(*, "inner")=Class 'formula' language ~Variety
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
na_func(Oats)
## Block Variety nitro yield
## 0 0 0 0
Oats$N <- as.factor(Oats$nitro)
replications(yield ~ Variety*N*Block, data = Oats)
## Variety N Block Variety:N
## 24 18 12 6
## Variety:Block N:Block Variety:N:Block
## 4 3 1
table(Oats$Variety, Oats$N)
##
## 0 0.2 0.4 0.6
## Golden Rain 6 6 6 6
## Marvellous 6 6 6 6
## Victory 6 6 6 6
hist(Oats$yield, col = "gold")
boxplot(yield ~ N, data = Oats, col = "dodgerblue1")
boxplot(yield ~ Variety, data = Oats, col = "red3")
Analyse data:
The format for specifying split-plot error terms is Error(blocking factor/main plot).
splot.oats <- aov(yield ~ Variety*nitro + Error(Block/Variety), data = Oats)
summary(splot.oats)
##
## Error: Block
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 5 15875 3175
##
## Error: Block:Variety
## Df Sum Sq Mean Sq F value Pr(>F)
## Variety 2 1786 893.2 1.485 0.272
## Residuals 10 6013 601.3
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## nitro 1 19536 19536 115.771 1e-14 ***
## Variety:nitro 2 168 84 0.499 0.61
## Residuals 51 8606 169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emtrends(splot.oats, ~ Variety, "nitro") # slopes
## Note: re-fitting model with sum-to-zero contrasts
## Note: re-fitting model with sum-to-zero contrasts
## Variety nitro.trend SE df lower.CL upper.CL
## Golden Rain 75.33333 11.85854 51 51.52632 99.14035
## Marvellous 64.58333 11.85854 51 40.77632 88.39035
## Victory 81.08333 11.85854 51 57.27632 104.89035
##
## Confidence level used: 0.95
emmeans(splot.oats, ~ Variety) # emmeans
## Note: re-fitting model with sum-to-zero contrasts
## NOTE: Results may be misleading due to involvement in interactions
## Variety emmean SE df lower.CL upper.CL
## Golden Rain 127.1000 8.570757 12.85 108.5618 145.6382
## Marvellous 129.1667 8.570757 12.85 110.6285 147.7048
## Victory 121.9500 8.570757 12.85 103.4118 140.4882
##
## Confidence level used: 0.95
for extracting model parameters, diagnostics and other model information
These work differently with different R object types. That is, different output will result depending on if a “lm”, “lme” or “merMod” (lmer) object is used in the function call.
# extract model summary
summary()
#extract coefficients:
coef()
#extract residuals
resid()
rstudent()
residuals()
# extract predicted values
fits()
# make diagnostic plots
plot()
# extract influence measures:
influence.measures()
#other fir diagnostics:
cooks.distance()
dffits()
dfbeta()
hat()
To see the all functions available for a particular type of linear model object, use:
methods(class = "lm") # for lm objects
## [1] add1 alias anova
## [4] Anova avPlot Boot
## [7] bootCase boxCox brief
## [10] case.names ceresPlot coerce
## [13] confidenceEllipse confint Confint
## [16] cooks.distance crPlot deltaMethod
## [19] deviance dfbeta dfbetaPlots
## [22] dfbetas dfbetasPlots drop1
## [25] dummy.coef durbinWatsonTest effects
## [28] emm_basis extractAIC family
## [31] formula hatvalues hccm
## [34] infIndexPlot influence influencePlot
## [37] initialize inverseResponsePlot kappa
## [40] labels leveneTest leveragePlot
## [43] linearHypothesis logLik mcPlot
## [46] mmp model.frame model.matrix
## [49] ncvTest nextBoot nobs
## [52] outlierTest plot powerTransform
## [55] predict Predict print
## [58] proj qqnorm qqPlot
## [61] qr recover_data residualPlot
## [64] residualPlots residuals rstandard
## [67] rstudent S show
## [70] sigmaHat simulate slotsFromS3
## [73] spreadLevelPlot summary variable.names
## [76] vcov
## see '?methods' for accessing help and source code
methods(class = "lme") # for lme objects
## [1] ACF anova Anova augPred
## [5] coef comparePred confint Confint
## [9] deltaMethod deviance emm_basis extractAIC
## [13] fitted fixef formula getData
## [17] getGroups getGroupsFormula getResponse getVarCov
## [21] influence intervals linearHypothesis logLik
## [25] matchCoefs nobs pairs plot
## [29] predict print qqnorm ranef
## [33] recover_data residuals S sigma
## [37] simulate summary update VarCorr
## [41] Variogram vcov
## see '?methods' for accessing help and source code
methods(class = "merMod") # for lmer objects
## [1] anova Anova as.function coef
## [5] confint cooks.distance deltaMethod deviance
## [9] df.residual drop1 emm_basis extractAIC
## [13] family fitted fixef formula
## [17] getL getME hatvalues influence
## [21] isGLMM isLMM isNLMM isREML
## [25] linearHypothesis logLik matchCoefs model.frame
## [29] model.matrix ngrps nobs plot
## [33] predict print profile ranef
## [37] recover_data refit refitML rePCA
## [41] residuals rstudent show sigma
## [45] simulate summary terms update
## [49] VarCorr vcov vif weights
## see '?methods' for accessing help and source code
The package “emmeans” also supports a large number of models.